Small but visible: Predicting rare bryophyte distribution and richness patterns using remote sensing-based ensembles of small models

In Canadian boreal forests, bryophytes represent an essential component of biodiversity and play a significant role in ecosystem functioning. Despite their ecological importance and sensitivity to disturbances, bryophytes are overlooked in conservation strategies due to knowledge gaps on their distribution, which is known as the Wallacean shortfall. Rare species deserve priority attention in conservation as they are at a high risk of extinction. This study aims to elaborate predictive models of rare bryophyte species in Canadian boreal forests using remote sensing-derived predictors in an Ensemble of Small Models (ESMs) framework. We hypothesize that high ESMs-based prediction accuracy can be achieved for rare bryophyte species despite their low number of occurrences. We also assess if there is a spatial correspondence between rare and overall bryophyte richness patterns. The study area is located in western Quebec and covers 72,292 km2. We selected 52 bryophyte species with <30 occurrences from a presence-only database (214 species, 389 plots in total). ESMs were built from Random Forest and Maxent techniques using remote sensing-derived predictors related to topography and vegetation. Lee’s L statistic was used to assess and map the spatial relationship between rare and overall bryophyte richness patterns. ESMs yielded poor to excellent prediction accuracy (AUC > 0.5) for 73% of the modeled species, with AUC values > 0.8 for 19 species, which confirmed our hypothesis. In fact, ESMs provided better predictions for the rarest bryophytes. Likewise, our study revealed a spatial concordance between rare and overall bryophyte richness patterns in different regions of the study area, which have important implications for conservation planning. This study demonstrates the potential of remote sensing for assessing and making predictions on inconspicuous and rare species across the landscape and lays the basis for the eventual inclusion of bryophytes into sustainable development planning.


Introduction
Canadian boreal forests represent 24% of the world's boreal forest [1]. In these forests, anthropogenic disturbances pose serious threats for boreal flora [2,3]. This is particularly true for sensitive plant species such as bryophytes, which have been recognized as reliable indicators of environmental changes [4][5][6]. Bryophytes are key constituents of biodiversity in Canadian boreal forests, promoting species richness [7,8] and supporting important ecosystem functions [8][9][10].
Forest management pressure is however affecting bryophyte diversity and community composition in the boreal biome, either through direct species removal or by altering habitat conditions originally suitable for bryophytes [11]. Forestry practices are also reducing the ecological continuity of forests, jeopardizing the recolonization processes after disturbance events [4,12]. Highly habitat-specific and/or dispersal-limited bryophyte species harbored by old-growth boreal forests may therefore be at risk [12]. Despite their ecological importance and sensitivity to disturbances, bryophytes are part of the vast unseen biodiversity that is currently ignored in most conservation plans [13,14].
Less known and represented in natural history collections than other groups such as birds, mammals or flowering plants, the large contribution of inconspicuous taxonomic groups to diversity is difficult to assess, and thus commonly operationalized using diversity measures of these other groups as surrogates [15,16]. However, these better-known taxonomic groups are poor surrogates for highly diverse but less showy or studied taxa [17]. Including inconspicuous species groups, such as bryophytes (e.g. [18]), representativeness in systematic conservation planning assessments would lead to more robust conservation measures [19].
From a conservation perspective, rare species deserve priority attention as they are at a high risk of extinction [20,21]. However, because of their own nature, many rare species of unseen biodiversity groups [19] suffer from a lack of information on environmental requirements or their distribution [22,23]. Species Distribution Models (SDMs), which allow to quantify the statistical relationships between species observations and environmental conditions from known locations, can provide useful tools for assessing ecological preferences of rare species or predicting their distributions [24,25]. More precisely, SDM-based predictions are achieved by using the relevant environmental conditions as proxies of species occurrence. However, the ability of traditional SDMs to predict rare species has been strongly limited by the number of occurrences available, with increases in prediction accuracy with increased sample size [26,27]. Furthermore, modeling species with low prevalence often results in a high predictors/ occurrences ratio, which can lead to model overfitting and reduced applicability to new data [28,29]. Fortunately, recent advances in modeling techniques and approaches such as Ensembles of Small Models (ESMs) have been shown to provide robust predictions for rare plants [20,28,30]. ESMs are ensembles of bivariate models generated from all pairwise predictor combinations from a larger set of predictors [20,28]. ESMs can produce more accurate predictions than traditional SDMs and reduce model overfitting for rare species [28]. In parallel, remote sensing (RS) offers a powerful tool to derive and integrate environmental information into SDMs and generate predictions on species distribution over large areas [18,31,32]. Although a considerable number of studies have successfully integrated RS predictors into SDMs [33][34][35], no study has generated ESMs using only RS predictors, nor has used this approach to generate SDMs of inconspicuous organisms such as bryophytes, much less of their rare species.
In this paper we use RS-derived predictors in an ESMs framework to produce predictive models of rare bryophyte species in Eastern Canadian boreal forests. Bryophyte rare species were selected based on their prevalence in the study area (<30 occurrences; [36]). This rare species selection approach was chosen because of the lack of knowledge on bryophytes related to their distribution, ecological preferences and abundance in the region [36], which make it difficult to apply more informative approaches such as multicriteria rare species classification methods (e.g. [37]). In fact, the most complete rare bryophyte species list published to date for the region used species' prevalence as the only criterion for rare species classification [38][39][40]. It should be noted that rare bryophytes from [38-40] were not targeted here as their low prevalence (�5 occurrences) greatly restricts the development of SDMs. We hypothesize that high ESMs-based prediction accuracy can be achieved for rare bryophyte species despite their low number of occurrences [28]. Our specific objectives are to assess i) if there is a relationship between the number of occurrences and the predictive performance of ESMs, ii) if the predictive performance of models varies by the modeled bryophyte guild (mosses, liverworts and sphagna), and iii) if there is a spatial relationship between the richness patterns of rare bryophyte species and overall bryophyte species both for bryophytes as a whole and at the guild level [18]. A total of 52 rare bryophyte species were targeted in the present study, including 33 mosses, 14 liverworts and 5 sphagna.

Bryophyte field data set
We used a 389-plot database of presences-only including the field data from three studies previously conducted in our study area [41][42][43], which integrated young, mature and old-growth forests and both recent fires and cut-blocks. The study area of 72,292 km 2 is located in the southwest of the Nord-du-Québec administrative region of western Quebec (48˚51' to 504 2'N and 74˚31' to 79˚26'W; Fig 1), within the Black spruce-feathermoss forest bioclimatic domain [44]. Natural dynamics of these forests are primarily driven by stand-replacing fires, whose cycle has been estimated at 398 years after 1920 [45]. The region is characterized by a flat topography, dominance of poorly drained clay soils and a moderately humid and cold climate (927.8 mm annual precipitation and 1.0˚C annual mean temperature) [46]. These conditions favor the accumulation of organic layer between fires, which is known as the paludification process [47,48].
Bryophytes were collected following a "floristic habitat sampling" method, which consists in collecting all bryophytes found in all microhabitats within 5 x 10 m plots [49]. Rare bryophyte species were selected based on their prevalence within the study area (<30 occurrences) [36]. From an initial set of 214 species, 142 rare species were pre-selected, and among them, only those with a minimum of 5 occurrences were retained for modeling, since meaningful predictions can be achieved at this sample size [50-52]. A total of 52 rare bryophyte species (33 mosses, 14 liverworts and 5 sphagna; S1 Table in S1 Appendix) were finally selected for modeling (species occurrence coordinates are shown in S1 Table).

Remote sensing environmental predictors
The selection of RS-derived predictors was carried out based on their sensitivity to environmental factors known to influence bryophyte distribution, namely topography, canopy cover and structure, and vegetation and soil moisture [18,[53][54][55]. Climatic variables were not included due to their coarse spatial resolution (� 1 km) and low spatial variability across the study area (annual mean temperature and total precipitation with an approximate variability range of 1˚C and 150 mm respectively [18]), which could lead us to overestimate the distribution of rare species. In addition, the climatic variability that could be integrated into the individual models of our rare species would be even more limited by the low number of available occurrences. It should be noted that climate variables also present lower reliability compared to RS variables at the scale of our study. This is because climatic variables are based on interpolation methods with high uncertainty, especially in northern latitudes where weather stations are scarce, while RS information is spatially continuous by nature. Therefore, we selected RS variables showing higher variability across the study area and capable of detecting changes in local conditions more closely related to bryophyte occurrence.
RS-derived environmental data were acquired using Google Earth Engine (GEE) [56]. The initial set of 6 predictors included topographic position index (TPI), 2-band enhanced vegetation index (EVI2), normalized difference water index (NDWI1), vegetation continuous fields (VCF), PALSAR HV/HH polarization index (PALSAR_HVHH), and bare soil index (BSI; see Table 1 for predictor descriptions). TPI was derived from the Shuttle Radar Topography Mission (SRTM) digital elevation model in ArcGIS v.10.5 [57] using an annulus neighborhood with inner and outer radius of 15 and 20 pixels, respectively. EVI2, NDWI1, and BSI predictors were derived from Sentinel-2 spectral bands. For each band, a mosaic was built from the images available for the summer season (July 1-August 31) between 2015-2019 to ensure homogeneity in the reflectance values [58]. Cloudy pixels were masked in all selected images using the Sentinel-2 QA60 band, which allows to identify pixels with opaque clouds and cirrus clouds. Mosaics were performed by applying the median of the overlapping pixel values. We chose EVI2 instead of EVI since EVI2 does not require the blue band, which is sensitive to the presence of residual clouds and aerosols [59]. VCF represents percent tree cover at 30 m resolution, after rescaling the 250 m MODIS VCF Tree Cover layer using circa-2010 and 2015 Landsat images and incorporating the MODIS Cropland Layer to improve accuracy in agricultural areas (https://catalog.data.gov/dataset/global-forest-cover-change-tree-cover-multi-year-global-30m-v003) [60]. The VCF predictor presented pixels (0.1% of the total) with missing values in the study area. PALSAR_HVHH was calculated as the ratio of HV-polarized to HH-polarized Lbands from the Advanced Land Observing Satellite (ALOS) Phased Arrayed L-band Synthetic Aperture Radar (SAR) [61]. HV-polarized and HH-polarized L-bands were averaged from yearly mosaics between 2015 and 2017. All predictors were generated and standardized at a 30 m spatial resolution (see Table 1 for original spatial resolutions). Pearson correlation coefficient was used to identify pairs of highly correlated predictors (|r|)> 0.7) from a set of 10,000 random background points. Only the NDWI1-BSI predictor pair showed a high correlation (r = -0.87). We retained NDWI1 which is sensitive to vegetation and soil moisture [62], since bryophytes are poikilohydric organisms whose distribution is highly dependent on available moisture [63,64]. This resulted in a final set of 5 uncorrelated predictors to run the models (Table 1).

Modeling approach: Ensembles of small models
ESMs based on bivariate models were developed to spatially predict 52 rare bryophyte species (5-29 occurrences) using two modeling machine-learning techniques: Maxent [69] and Random Forest (RF) [70]. Both Maxent and RF techniques can provide robust predictions when few occurrences are available [50, 71,72]. Maxent estimates the probability distribution for a given species by finding the probability distribution of maximum entropy according to a set of constraints representing the input known locations [69]. RF uses a bootstrap aggregation technique to provide mean predictions from a multitude of independent decision trees built from randomly selected subsamples from the training dataset [70]. A random subset of candidate predictors is assessed to split each node of each individual tree, selecting the predictor that provides the most information in each case [73]. ESMs were generated in R v.3.6.3 [74] using the biomod2 package v.3.4.6 [75]. As we used presence-only data, 10,000 background points were randomly generated within the study area and used as pseudo-absences for all species. Presences and pseudo-absences were weighted equally for training the ESMs. The pairwise combinations of our 5 final predictors resulted in 10 candidate bivariate models per modeling technique (Maxent and RF) for each species. We used default settings of the biomod2 package for computing Maxent and RF models. Predictive performance of each bivariate model was assessed via 10-fold cross-validation procedure, using 80% of the data to train the model and 20% for its validation. While we acknowledge that validation would be optimal using an external dataset, this is hardly available when dealing with rare species. The Somers' D metric was used to identify and select bivariate models better than random (Somers' D score > 0, i.e. AUC > 0.5). Maxent-ESMs and RF-ESMs were then performed using a weighted mean of predicted probabilities from their corresponding retained bivariate models based on their Somers' D scores [20,28]. The contribution of each bivariate model was thus proportional to its predictive accuracy. The final ESMs selected for each species was generated by weighted averaging predictions from Maxent-ESMs and RF-ESMs. Predictive performance of final ESMs was evaluated using the area under the receiver operating characteristic curve (AUC), and the true skills statistic (TSS). AUC is not dependent on a threshold and ranges from 0.5 for an uninformative model to 1 for a perfect fit model, while TSS ranges from -1 to 1 and was chosen instead of kappa because it is not affected by prevalence [76]. Since AUC and TSS values were highly correlated (Pearson r > 0.95), the results and discussion on models' overall predictive performance will be based on the AUC statistics, following [28] and [76]. The statistic sensitivity was also calculated, which allows the assessment of the proportion of actual presences correctly predicted [77]. We computed sensitivity for those species whose final ESMs were better than random (AUC > 0.5). Besides of the continuous models (values 0-1000), we generate binary models (presence/absence) using the maximum training sensitivity plus specificity threshold, or TSS optimum (Fig 2; predictive mapping of the distribution of the target species is available in [78]). Finally, we mapped the richness patterns (species number) for total rare bryophyte species, as well as for rare species by guild, by stacking their binary predictions (presence/absence). Missing values associated with the predictions of the three species that included the VCF predictor in their final models were classified as absences before richness computation. We then compared the spatial richness patterns obtained here for rare species with those obtained recently for overall bryophyte species in a smaller region (28,436 km 2 ) but fully included in our study area at the same spatial resolution (30 m) [18]. The comparison was performed for bryophytes as a whole (i.e. rare bryophyte richness versus overall bryophyte richness), and between homologous bryophyte guild pairs. This spatial correspondence analysis was carried out using Lee's L statistic [79] through the lee() function from the spdep package v.1.1-5. [80]. Lee's L statistic, in contrast to non-spatial bivariate association measures such as Pearson's correlation coefficient, integrates and corrects for the spatial autocorrelation of each variable when computing the pixel-to-pixel spatial correlation [79]. Due to the high computational requirements to carry out this analysis, the 30 m pixels were previously averaged into 300 m pixels through the aggregate() function of the raster package v.3.4-5 [81]. Outputs of lee() function were centered at 0 and re-scaled to -1 and 1 to facilitate the interpretation of the results by subtracting the overall mean and dividing by the maximum value [82]. We then calculated, for each pixel, the quantile associated with its Lee's L value using a Monte Carlo test with 999 simulations in order to identify significant positive (quantile >0.975) or negative (quantile <0.025) spatial associations.

Species traits characterization
Species traits can influence the accuracy and therefore the ability of SDMs to predict their occurrence [83,84]. We evaluated the relationship between ESMs' model performance, as measured by AUC, and rare species traits, namely substrate preference (six categories), reproduction mode (three categories), and spore size (maximum and minimum; S1 Table in S1 Appendix), as well as their interactions. This assessment was performed using a multiple linear regression through the lm() function from the stats package v.3.6.3 [74]. Relationships were considered significant at α = 0.05.

ESMs' predictive performance versus number of occurrences and bryophyte guilds
RS-based ESMs provided poor to excellent predictive accuracy for 38 of the 52 modeled rare species, with AUC values ranging from 0.551 to 0.979 and a mean AUC (mAUC) of 0.795 ± 0.132. Of these 38 species, 19 species were predicted with AUC values greater than 0.8, confirming our hypothesis that high ESMs-based prediction accuracy can be achieved for rare bryophyte species despite their low number of occurrences (<30). Sensitivity for these 38 species ranged from 0.8 to 1 with an average of 0.959 ± 0.063, indicating that actual presences were usually accurately predicted. Only predictions for 14 species were not better than random (AUC � 0.5). Regarding our first specific objective, a negative correlation (Pearson r = -0.34) was found between the number of occurrences of the 52 target species and the predictive accuracy as measured by AUC. This negative correlation was also observed at the guild level (Fig 3).
To accomplish our second specific objective, we grouped the 52 modeled species by guild and found that predictive accuracy was similar for mosses (mAUC = 0. 715 ± 0.167) and liverworts (mAUC = 0.735 ± 0.185), and lower for sphagna (0.663 ± 0.208). No significant relationships were found between ESMs' performance and rare species traits (or their interactions).

Richness patterns of rare bryophyte species
Predictive mapping of richness patterns of total rare bryophyte species and rare species at the guild level (mosses, liverworts and sphagna) are presented in Fig 4. Predicted richness values ranged from 0 to 30, 21, 9, and 3 species, respectively. The richness pattern of total rare bryophytes was largely structured by the similar richness patterns observed for rare mosses and liverworts, with high richness values mostly found towards the center and southwest of the study area. Conversely, rare sphagna species were concentrated in very specific areas mainly towards the north of the study area with two additional spots towards the southeast.
Regarding our third specific objective, the Lee's L statistic identified areas of significant positive and negative spatial association between rare and overall species richness for the four homologous bryophyte group pairs (Fig 5). Large areas in which the spatial association between the two types of richness was not significant were also consistently observed across pairs.

Discussion
Boreal regions are large areas lacking sharp environmental contrasts, as shown by the low variability of our predictors (Fig 6), and thus a habitat where obtaining high-performance SDMs can be challenging. Despite this, our ESMs provided reasonably accurate predictions for rare bryophytes using only 5 uncorrelated RS predictors. Specifically, RS-based ESMs provided poor to excellent predictive accuracy for 73% of the target species despite their very low number of occurrences. Indeed, 16 species with less than 10 occurrences showed an AUC > 0.7. In addition, the computation of the metric sensitivity allowed us to independently show the ability of our ESMs to accurately predict known presences, with high values for the 38 species modeled better than random. Therefore, the combination of RS data at 30 m spatial resolution and ESMs proved to be a powerful approach to predict the distribution of rare bryophyte species in Eastern Canadian boreal forests.
The negative relationship found between models' predictive performance and the number of occurrences of all bryophytes, as well as at the guild level (Fig 3), illustrated the suitability of ESMs for predicting the distribution of very rare bryophyte species regardless of guild. This result agrees with those obtained in [28], who showed a higher predictive performance of ESMs for the rarest vascular plants. Regarding bryophyte species by guild, we consider that the lower overall predictive performance obtained for sphagna species compared to that of mosses and liverworts may be an artifact resulting from the low number of rare sphagna species modeled (n = 5). In fact, the occurrences of two of these five sphagna species were successfully predicted (AUC values of 0.76 and 0.97). However, we do not exclude the possibility that some ecologically meaningful variables that describe the habitat of these species, such as drainage class [18], were missing from our models.
In general, our results show that the development of SDMs from RS data allows not only to make predictions of rare species distribution at spatial scales relevant to ecological planning, but also to do so at a level of detail (30 m resolution) that can not be achieved using the traditionally used climatic variables at coarse resolutions (� 1 km). This is particularly important for inconspicuous species such as bryophytes, which interact with their environment at more local scales [85][86][87] and for which the use of coarse resolutions can result in a critical lose of information. Likewise, SDMs developed at coarse resolutions can overestimate species distribution [88] and greatly limits the practical utility of derived predictions to subsequently detect species in the field [89]. On the other hand, the wide variety of potentially relevant predictors for rare plants that can be derived from RS (related to vegetation, humidity, forest structure, topography, etc.) [90], can allow a more realistic approach to the environment-species relationship, which can be particularly useful for species with complex ecological niches. Thus, our methodology can play an important role in filling existing knowledge gaps on bryophyte distribution ranges, as well as their ecological preferences, in largely unexplored regions such as boreal forests [36].
The Identification of diversity hotspots has been one of the most used criteria in biodiversity conservation planning in order to locate areas of biological and ecological interest that should be prioritized by decision makers [91][92][93]. Conservation measures targeting these areas will be more effective if multiple components of biodiversity are spatially concentrated [92,94,95]. Specifically, both species richness and the presence of rare species have frequently been cited as the main criteria to select areas for conservation [96,97], while many rare species might not be represented in species-rich areas [94]. Our study however revealed a spatial concordance between the richness of overall bryophyte species and that of their rare taxa in different regions of the study area (Fig 5). While more bryophyte biodiversity components could be subsequently evaluated, this result have important implications for Canadian conservation planning. We consider that the identification of areas harboring high level of both overall and rare bryophyte species diversity, as well as the development of informative tools that serve these purposes, is a significant and necessary step to promote the systematic integration of these species into conservation plans and programs [91]. Likewise, conservation planning targeting bryophytes and other inconspicuous taxa could further benefit from individual SDMsbased predictions as a basis for assessing their representation in nature reserve networks [98], to quantify the impact of land use changes on their distribution ranges [99], to inform assessments of their conservation status [100,101], and to identify suitable areas for their recovery or reintroduction [102].

Conclusions
Our work demonstrates the ability for RS data to characterize the habitat of rare bryophyte species and predict their distribution patterns across the landscape. This study also reaffirms the effectiveness of ESMs in estimating rare plant distributions [20,28], and highlights, for the first time, the suitability of this modeling approach for making predictions of inconspicuous rare species. We consider that our methods and results provide an important advance in the application of techniques focused on the study of bryophytes, with potential valuable applications for their management and conservation. In fact, although our study focuses on a particular taxonomic group, the combined use of ESMs and RS would lend useful results for other overlooked inconspicuous taxa lacking information on distribution, which would facilitate their integration in systematic conservation planning.
Supporting information S1